Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Eradicate unpaged SELECTs and warn about their drawbacks #1068

Merged

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Aug 27, 2024

Follow-up of #1061.

Motivation

Unpaged SELECTs should be discouraged due to the following drawbacks:

However, query_unpaged will return all results in one, possibly giant, piece
(unless a timeout occurs due to high load incurred by the cluster).
This:

  • increases latency,
  • has large memory footprint,
  • puts high load on the cluster,
  • is more likely to time out (because big work takes more time than little work,
    and returning one large piece of data is more work than returning one chunk of data).

What's done

  • the above caveat is added in the documentation book,
  • existing examples are modified to use query_iter instead,
  • giant bonus: a broad overview over CQL statement kinds is added, together with best practices wrt preparing, batching and paging.

Note to reviewers

Please make sure (by reading both crate docs and the docs book) that in all places it is clear enough that SELECTs by default should be done using query_iter rather than query_unpaged, and why so.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • [ ] I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • [ ] I added appropriate Fixes: annotations to PR description.

In paged.md such formulation was already rephrased in PR with paging API
refactor.
We shouldn't encourage users to perform unpaged SELECTs. Normally,
SELECTs should be paged, therefore the whole doc series on CQL data
types is modified to use query_iter instead of query_unpaged.
We shouldn't encourage users to perform unpaged SELECTs. Normally,
SELECTs should be paged, therefore the example in README is modified
to use query_iter instead of query_unpaged.
We shouldn't encourage users to perform unpaged SELECTs. Normally,
SELECTs should be paged, therefore the quickstart example is modified
to use query_iter instead of query_unpaged.
@wprzytula wprzytula requested a review from fee-mendes August 27, 2024 15:04
@wprzytula wprzytula self-assigned this Aug 27, 2024
@wprzytula wprzytula added area/documentation Improvements or additions to documentation area/statement-execution labels Aug 27, 2024
@wprzytula wprzytula added this to the 0.14.0 milestone Aug 27, 2024
Copy link

github-actions bot commented Aug 27, 2024

cargo semver-checks found no API-breaking changes in this PR! 🎉🥳
Checked commit: 8604a33

@wprzytula
Copy link
Collaborator Author

NOTE: examples in examples/*.rs are still not adjusted this way. I suggest a separate PR for them.

Copy link
Member

@fee-mendes fee-mendes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it LGTM.

One thing I just realized is that paged.md still contains the old performance disclaimer. I thought query_iter vs the new execute_unpaged had their own caveats.

Either way, I guess it is fine to follow-up somewhere else and it shouldn't block the existing work.

@muzarski
Copy link
Contributor

In docs/execution-profiles/priority.md:

let mut query = Query::from("SELECT * FROM ks.table");

// Query is not assigned any specific profile, so session's profile is applied.
// Therefore, the query will be executed with Consistency::One.
session.query_unpaged(query.clone(), ()).await?;

query.set_execution_profile_handle(Some(query_profile.into_handle()));
// Query's profile is applied.
// Therefore, the query will be executed with Consistency::Two.
session.query_unpaged(query.clone(), ()).await?;

query.set_consistency(Consistency::Three);
// An option is set directly on the query.
// Therefore, the query will be executed with Consistency::Three.
session.query_unpaged(query, ()).await?;

I'd use Session::query_single_page here.

Copy link
Contributor

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But otherwise, LGTM

@wprzytula
Copy link
Collaborator Author

In docs/execution-profiles/priority.md:

let mut query = Query::from("SELECT * FROM ks.table");

// Query is not assigned any specific profile, so session's profile is applied.
// Therefore, the query will be executed with Consistency::One.
session.query_unpaged(query.clone(), ()).await?;

query.set_execution_profile_handle(Some(query_profile.into_handle()));
// Query's profile is applied.
// Therefore, the query will be executed with Consistency::Two.
session.query_unpaged(query.clone(), ()).await?;

query.set_consistency(Consistency::Three);
// An option is set directly on the query.
// Therefore, the query will be executed with Consistency::Three.
session.query_unpaged(query, ()).await?;

I'd use Session::query_single_page here.

Wouldn't query_single_page (comparing to query_iter) introduce more noise, as it involves manual passing paging state?

@wprzytula
Copy link
Collaborator Author

wprzytula commented Aug 28, 2024

A general thought:

Now all SELECTs are recommended to be done using {query,execute}_iter. This means that all SELECTs would introduce considerable overhead of using the RowIteratorWorker, even if the SELECT only returns a single row. This isn't optimal.

However, let's stick to this approach, because it guarantees correctness wrt paging. In GRER (Giant Request Execution Refactor) we'll think about redesigning iteration over pages so that it's not that expensive, so by 0.17 we should have this solved.
WDYT @fee-mendes @Lorak-mmk @muzarski ?

@fee-mendes
Copy link
Member

fee-mendes commented Aug 28, 2024

I think it is fine. As long as we clarify the Performance section in paged.md as I pointed out earlier and discuss these points in depth there.

Probably that's how you want to introduce the paged queries doc

@wprzytula
Copy link
Collaborator Author

One thing I just realized is that paged.md still contains the old performance disclaimer. I thought query_iter vs the new execute_unpaged had their own caveats.

What disclaimer do you have in mind? All that I can see seem to be up-to-date.

@muzarski
Copy link
Contributor

In docs/execution-profiles/priority.md:

let mut query = Query::from("SELECT * FROM ks.table");

// Query is not assigned any specific profile, so session's profile is applied.
// Therefore, the query will be executed with Consistency::One.
session.query_unpaged(query.clone(), ()).await?;

query.set_execution_profile_handle(Some(query_profile.into_handle()));
// Query's profile is applied.
// Therefore, the query will be executed with Consistency::Two.
session.query_unpaged(query.clone(), ()).await?;

query.set_consistency(Consistency::Three);
// An option is set directly on the query.
// Therefore, the query will be executed with Consistency::Three.
session.query_unpaged(query, ()).await?;

I'd use Session::query_single_page here.

Wouldn't query_single_page (comparing to query_iter) introduce more noise, as it involves manual passing paging state?

That's true. My only concern was that we use query_unpaged for a SELECT * statement. But this doc wasn't actually created to showcase the query_unpaged/query_iter differences (the result is not even used here), so I think it's fine to leave it as query_unpaged.

@muzarski
Copy link
Contributor

muzarski commented Aug 28, 2024

A general thought:

Now all SELECTs are recommended to be done using {query,execute}_iter. This means that all SELECTs would introduce considerable overhead of using the RowIteratorWorker, even if the SELECT only returns a single row. This isn't optimal.

However, let's stick to this approach, because it guarantees correctness wrt paging. In GRER (Giant Request Execution Refactor) we'll think about redesigning iteration over pages so that it's not that expensive, so by 0.17 we should have this solved. WDYT @fee-mendes @Lorak-mmk @muzarski ?

In examples refactor (that I'm currently working on), I decided to make use of query_single_page when we are sure that result will consist of a single row (or when we want to showcase the QueryResult API).

@wprzytula
Copy link
Collaborator Author

A general thought:
Now all SELECTs are recommended to be done using {query,execute}_iter. This means that all SELECTs would introduce considerable overhead of using the RowIteratorWorker, even if the SELECT only returns a single row. This isn't optimal.
However, let's stick to this approach, because it guarantees correctness wrt paging. In GRER (Giant Request Execution Refactor) we'll think about redesigning iteration over pages so that it's not that expensive, so by 0.17 we should have this solved. WDYT @fee-mendes @Lorak-mmk @muzarski ?

In examples refactor (that I'm currently working on), I decided to make use of query_single_page when we are sure that result will consist of a single row (or when we want to showcase the QueryResult API).

Remember that even if the result is going to consist of a single row, in some situations (e.g. when there are a lot of tombstones - @fee-mendes please confirm) there can be a number of empty pages received before we finally get one with our expected row.

@fee-mendes
Copy link
Member

Remember that even if the result is going to consist of a single row, in some situations (e.g. when there are a lot of tombstones - @fee-mendes please confirm) there can be a number of empty pages received before we finally get one with our expected row.

Correct. If the query is paged, empty replica pages will kick in. If it is unpaged, then you retrieve the full result set.

In general, it is hard to guarantee a SELECT will return only a single row, or a single page. LIMIT 1, aggregations (COUNT()), and key-value tables are probably the only exceptions.

@fee-mendes
Copy link
Member

fee-mendes commented Aug 28, 2024

One thing I just realized is that paged.md still contains the old performance disclaimer. I thought query_iter vs the new execute_unpaged had their own caveats.

What disclaimer do you have in mind? All that I can see seem to be up-to-date.

I am reading paged.md which says:

### Performance
Performance is the same as in non-paged variants.\
For the best performance use [prepared queries](prepared.md).

Well, this doesn't seem to be accurate. In fact, it was never accurate.

  • It doesn't specify which is performant (query_single_page or *_iter or both?)
  • Under which circumstances unpaged or query_single_page are more performant than *_iter?
  • It doesn't explain why nor when one should choose query_single_page vs *_iter
  • It leads one to believe that they can use non-paged queries without any discrimination - as we say "performance is the same" - which is never true;

Maybe rather than talking about Performance, we should just recommend what's better instead. Always prefer paged queries, from the driver's perspective use X as it performs better due to A, B, C. We also provide Y, which is slightly less performant on the client-side given D, E, F.

// some warning disclaimer on unpaged //

@wprzytula
Copy link
Collaborator Author

I am reading paged.md which says:

### Performance
Performance is the same as in non-paged variants.\
For the best performance use [prepared queries](prepared.md).

Right. This particular statement was never true wrt {query,execute}_iter, which are always at least a bit less performant than {query,execute}_{single_page,unpaged}. I'll correct those recommendations as you suggest.

@fee-mendes
Copy link
Member

Right. This particular statement was never true wrt {query,execute}_iter, which are always at least a bit less performant than {query,execute}_{single_page,unpaged}. I'll correct those recommendations as you suggest.

Well, be mindful though - that performance is subjective even in the {query,execute}_{single_page,unpaged} case. I'd argue that single_page is always more performant than unpaged, as it requires less memory footprint given that unpaged queries may return a large QueryResult when scanning large keys, not to mention the pressure on the server-side.

Hence why I think the focus should shift from "Performance" to our recommended "Best Practices", with both client/server in mind.

Comment on lines 17 to 21

> To sum up, for SELECTs that may return return a lot of data prefer paged queries,
> e.g. with `Session::query_iter()` (see [Paged queries](paged.md)).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove a remark about a lot of data because of the tombstone problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe rephase it as (especially those that may return a lot of data)?

@wprzytula
Copy link
Collaborator Author

wprzytula commented Aug 29, 2024

New commit addressing @fee-mendes' comments: giant overview of statements and methods of executing them, together with best practices.

@wprzytula wprzytula force-pushed the eradicate-unpaged-selects-from-docs branch from 349bc79 to 5b39d8f Compare August 29, 2024 11:14
@wprzytula
Copy link
Collaborator Author

wprzytula commented Aug 29, 2024

Addressed Karol's comments.

| Cluster load | potentially **HIGH** for large results, beware! | normal | normal |
| Driver overhead | low - simple frame fetch | low - simple frame fetch | considerable - `RowIteratorWorker` is a separate tokio task |
| Feature limitations | none | none | speculative execution not supported |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should print warning / erorr query if speculative execution is set on execution profile used in {query ,execute}_iter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that might be a good idea, but OTOH forcing using different profiles for iter and non-iter execution and thus being inconvenient (for speculative execution users).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about situation where user wants to use speculative execution, but silently doesn't because of _iter methods. You raised good point, I'm not sure how to best handle this possible issue.

Comment on lines +233 to +234
| Suitable operations | - in general: operations with empty result set (non-SELECTs)</br> - as possible optimisation: SELECTs with LIMIT clause | - for advanced users who prefer more control over paging, with less overhead of `RowIteratorWorker` | - in general: all SELECTs |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't SELECT with limit still susceptible to tombstones problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +70 to +87
| Suitable operations | - in general: operations with empty result set (non-SELECTs)</br> - as possible optimisation: SELECTs with LIMIT clause | - in general: all SELECTs |

For more detailed comparison and more best practices, see [doc page about paging](paged.md).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto about LIMIT

We shouldn't encourage users to perform unpaged SELECTs. Normally,
SELECTs should be paged, therefore caveat about that is added in two
relevant places in docs.
This wasn't reverted when design regarding PageSize changed during code
review.
In order to avoid API misuse, much knowledge is now shared in a
structured way of tables, and best practices are described to aid users.
@wprzytula wprzytula force-pushed the eradicate-unpaged-selects-from-docs branch from 5b39d8f to 8604a33 Compare August 29, 2024 13:20
@wprzytula
Copy link
Collaborator Author

wprzytula commented Aug 29, 2024

Addressed another round of @Lorak-mmk and @muzarski comments.

@dkropachev dkropachev merged commit edc2ae2 into scylladb:main Sep 4, 2024
12 checks passed
@wprzytula wprzytula deleted the eradicate-unpaged-selects-from-docs branch September 4, 2024 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/documentation Improvements or additions to documentation area/statement-execution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants